Outline

  • 13.30-14.30: Talk
  • 14.30-15.00: Break
  • 15.00-16.00: Hands-on: Create your own visualisation

Disclaimer

  • Some ideas and materials used here are from the MAsc in Data Visualization taught at the Centre for Interdisciplinary Methodologies at the University of Warwick. Thanks to their authors: Greg McInerny, Cagatay Turkay, Timothy Monteath

  • Other ideas are from the DNG Research Fund 2024 m project “Can Digital Goods be Neutral?”. Thanks to all those involved: Timothy Monteath, Selene Yang, Silvia Ribera Alfaro, Alejandra Canclini.

  • This is not a course on how to do (good) data visualisations (there are entire degrees for that!)

  • Impossible to provide a training on data vis in 2h.

    • Plenty of training and videos that promise tools or tricks about data vis

Setting the expectations

The usual approach

  • Tools: Tableau, Power BI, Javascript libraries, Observable, Python, R…

  • Chart types: sunburst, sankey, spyder, violin, lollypop…

  • Recipes / Tricks: to do specific tasks: sorting, adding interactivity, compositing images…

Radial histogram. Source: https://www.datylon.com/blog/types-of-charts-graphs-examples-data-visualization#radial-histogram

Yes

They all are important stuff:

  • Tools define what we can and can’t do, and how we do it (consider affordances, workflows, price, features…)

  • Choosing the right Chart types is not trivial, and play a key role in succeeding or failing at what we want to do with data vis.

  • Recipes / Tricks are ideal to quickly learn specific, frequent tasks.

…But

  • They are usually subject to trends or commercial interests (Newer != better)

    • New chart types and software are constantly created (also, see Wickham (2010))
  • Too specific, difficult to generalise from or start from

  • Data visualisations are way more than a series of technical choices.

Florence Nightingale created the famous (and still trendy!) Coxcomb diagram in 1858!

Aims

  • Critically understand data visualisations

    • how can we use them in research
  • Provide criteria to get started

Do not expect recipes, and expect to leave with more questions than when you started.

…but do expect to learn and enjoy this introductory session.

What are data visualisations?

Let’s see some examples. Are these data visualisations?

Scatterplots are a very popular and effective visualisation (even as basic as this one!) to detect correlations.

Interesting use of legends. Source: unknown, seen in social media.

Imago Mundi Babylonian map, the oldest known world map, 6th century BC Babylonia. Now in the British Museum.

Source: https://errantscience.com

Source: Abela (2006)

An infographic by Guerrilla Girls (1989), as seen in d’Ignazio and Klein (2023)

Confusing?

Source: How I Met Your Mother (2005-2014)

Let’s see what we’ve learnt:

  • We may have an understanding of what Data Visualisations are, but defining them is not always easy.

  • What defines a data visualisation are not types (scatterplots, pie, bars…)

    • Actually, we’ve seen visualisations that do not fit in a particular type of chart (see Grammar of Graphics (Wickham 2010) )
  • …nor the tools used to generate them (e.g. Babylon map)

  • …nor their visual qualities or complexity (e.g. scatterplot)

  • … or their accuracy, even!

  • They all have something in common… data!

    • Mapping data to specific features (colours, shapes, position…)

A note on data

There’s a constant in every definition: Data.

  • Data vis needs data.

  • Data can be of different types -> Defines what we can/can’t do with it

    • format

    • structure

  • Data manipulation comes hand in hand with data visualisation

    • Filtering

    • Reshaping

    • Modifying

Source Munzner (2015)

Good data vis != good data

We can create great data visualisations with totally wrong data!. Source: Tyler Vigen’s Spurious Correlations

Definitions

It is commonly stated that a picture tells a thousand words. As such, the visual register has long been used to summarise and describe datasets through statistical charts and graphs, diagrams, spatialisations, maps, and animations. These visual methods effectively reveal and communicate the structure, pattern and trends of variables and their interconnections. Given the enormous volumes and velocity of big data, it is then no surprise that visualisation has proven a popular way for making sense of data and communicating that sense. (Kitchin 2014, chap. 6)

Computer-based visualization systems provide visual representations of datasets designed to help people carry out tasks more effectively. (Munzner 2015, 1)

Visualizations are often described by the constraints from which they were designed:

  • the resources needed (data, media…)

  • the task to be enabled (to locate, compare, reflect)

  • the context of use (e.g. users, situation, device).

(McInerny 2018)

Uses

Communicating information

This usually comes at the end of the process. Here we use data vis to convey clear, memorable messages from complex data.

Source: https://graphics.wsj.com/infectious-diseases-and-vaccines/

Data visualisations about Climate change, by Mark Poynting, Erwan Rivault, Becky Dale. Source: BBC

Histogram of new Mastodon accounts between April 2022 and February 2023. Source: Cámara-Menoyo and Tkacz (2023).

Greenhouse emissions. Source: UN climate report

Source: The Economist

Believe it or not, these two charts are showing the same data, but different design decisions. Can you spot the differences?

Hans Rosling’s famous presentation skills include the use of good data visualisations, animations, pauses, facts… to effectively communicate about data.

This interactive dashboard encourages users to focus on one area – a country, region, or income level – and see how it compares to its counterparts across a wide range of metrics. Author: Lindsey Poulter. Source: https://lindseypoulter.com/wdvp/

Sankey Diagram with annotations

Summary

  • This is a (long) iterative process of continuous refinement.

  • To make information memorable, we need to consider:

    • Audience and literacies <> chart types and texts

    • Aesthetics: how we map data into colours, shapes, sizes, position…

    • Activation: titles, labels, fonts… messages… anything we neeed to make sure that the message we want to convey is understood.

Source: Enrico Bertini

Data exploration/analysis

This comes at early stages of the research process.

Here, we want data visualisations to be quick and provide insights to inform our research .

We may want to interrogate our data to:

  • Discover

  • Search

  • Query

  • Identify trends, outliers, features

  • Understand attributes (distribution, correlation…)

Source: Munzner (2015)

Example

Consider this dataset containing three columns and 1846 rows:

category x y
7 55.10590 46.14905
4 50.21088 80.94177
11 28.73914 62.72086
8 61.58684 11.02759
7 37.62705 93.06067
7 76.98827 54.20278
4 50.27493 49.62350
2 68.19142 45.16949
12 36.65927 17.68196
13 76.66670 75.25640
6 78.09366 77.75956
9 57.14469 79.19410
3 77.08448 51.96714
3 50.50140 17.10577
11 86.43590 59.79276
Figure 1: Table showing random 15 rows of our dataset to have a sense of our data.

We see we have a column called category, so we can get a summary statistics of the x and y for each category:

category mean_x sd_x min_x max_x mean_y sd_y min_y max_y
1 54.26610 16.76983 15.56075 91.63996 47.83472 26.93974 0.0151193 97.47577
2 54.26692 16.77000 27.43963 77.91587 47.83160 26.93790 0.2170063 99.28376
3 54.26030 16.76774 25.44353 77.95444 47.83983 26.93019 15.7718920 94.24933
4 54.26993 16.76996 30.44965 89.50485 47.83699 26.93768 2.7347602 99.69468
5 54.26144 16.76590 22.00371 98.28812 47.83025 26.93988 10.4639152 90.45894
6 54.26881 16.76670 17.89350 96.08052 47.83545 26.94000 14.9139625 87.15221
7 54.26785 16.76676 18.10947 95.59342 47.83590 26.93610 0.3038724 99.64418
8 54.26588 16.76885 20.20978 95.26053 47.83150 26.93861 5.6457775 99.57959
9 54.26732 16.76001 21.86358 85.66476 47.83772 26.93004 16.3265464 85.57813
10 54.26873 16.76924 19.28820 91.73554 47.83082 26.93573 9.6915471 85.87623
11 54.26734 16.76896 27.02460 86.43590 47.83955 26.93027 14.3655905 92.21499
12 54.26015 16.76996 31.10687 85.44619 47.83972 26.93000 4.5776614 97.83761
13 54.26327 16.76514 22.30770 98.20510 47.83225 26.93540 2.9487000 99.48720

It is difficult to see any pattern (groups have almost identical means and standard deviations). Could data visualisations be more useful?

We can create a scatterplot and assign a colour per each category:

ggplot(datasaurus, aes(x = x, y = y, color = category)) +
  geom_point()

Still not very useful unless you have extraordinary sight for pattern recognition. Let’s try something different: let’s create small multiples per category (i.e., an array of independent plots for each category)

ggplot(datasaurus, aes(x = x, y = y, color = category)) +
  geom_point() +
  facet_wrap(~category)

This famous dataset is called datasaurus dozen and was constructed in 2016 by Alberto Cairo for illustrative purposes (see Matejka and Fitzmaurice (2017))

Correlations

Pairwise correlations between 4 variables. Source: Turkay, Cámara, and Tripp (2023), chap 15

Clustering / dimension reduction

Interactive cluster analysis using Orange. Source: https://orangedatamining.com/widget-catalog/bioinformatics/cluster_analysis/

Summary

Here the importance lies more on Speed than aesthetics:

  • They are part of a process: they need to be easy to generate and adapt

    • Usually hand-in-hand with data manipulation

    • Integrated in the research workflow

    • Importance of programming languages (R, Python) or Visual Programming Languages (Orange)

  • They need to be understandable by the researcher/research team/reviewers

    • labels, titles… are less important here

    • academic audience only -> scientific types (e.g., boxplots…) are welcome

As artistic expression

The effects of red lining on black population. Source: Morrison (2017)

Joy Division’s iconic cov4er for their debut album (1979), created by designer Peter Saville. showing a comparison of signals from a pulsar. +info: https://www.fastcompany.com/1671015/the-data-viz-story-behind-joy-divisions-legendary-album-cover

aRtsy - R package that implements algorithms for making generative art in a straightforward and standardized manner using ‘ggplot2’.

Summary

  • Data visualisations have so much in common with design1 that they can become design objects by themselves!
    • Data may be informing/generating the design, but those patterns may not be explicit nor evident

As a research method

Halfway between the previous examples. This is a whole research process.

Because Data vis need data which can be represented in different ways, and different representations convey different messages, they are excellent mechanisms to:

  • articulate discussions about the data that is available or is missing, and how is represented or hidden… and what is it telling
  • enquiry about data and surfacing key concepts that may have been hidden otherwise.
  • empower communities involved

Co-design board with stakeholders used to decide data, types, questions… Source: Cámara-Menoyo et al. (2024)

Motivation

We wanted to understand how these decisions around mapping were impacting under-represented communities.

We wanted to study a particular type of digital good: OpenStreetMap, to understand how neutrality is used to favour or hamper equity.

Katherine d’Ignazio and Lauren Klein’s (2016, 2023) Data feminism design principles

OpenStreetMap (OSM)

OSM is the largest and most exhaustive collaborative map of the world.

OSM is a map

Editing OSM Data

OSM is a database

OSM is also a community of 10,000,000 users worldwide! (mostly volunteers, but not always)

OSM’s data, contributed by 10,000,000 volunteers,

complements official data sources, and

populates thousands of tools and services.

Major Sites: Amazon, Apple, Baidu Maps, Facebook, Microsoft, Wikipedia and Wikimedia

Transport: Air France, Alaska Airlines, Deutsche Bahn, Grab, SNCF (French rail agency), Uber, Geodata Software and Services: CARTO, Digital Globe, ESRI, Garmin, Mapbox, Telenav

Government: Agence Française de Développement, Government of Brazil, Government of Italy: President’s Office, Police Scotland, US National Park Service, US State Department, USAID, Peace Corps….

Like Wikipedia, OSM is based

on principles of openness and neutrality

OpenStreetMap maps world as it exists, and includes mapping borders and countries according to actual current situation and not a preferred or ideal situation” (OSM Wiki)

What?

  • Is OSM as neutral as it claims to be?
  • How is the notion of neutrality being implemented?
  • How is ‘neutrality’ affecting other, underrepresented demographics?

How?

Participatory research: teaming with GeoChicas to codesign data visualisations

GeoChicas is a collective of feminist women linked to OpenStreetMap, originally Spanish-speaking, who work for women’s empowerment and the reduction of the gender gap in OpenStreetMap communities and in communities associated with the world of free software and open data.

Data visualisations as a method to:

  • think about data (and representation)
  • communicate and surface controversies

Understanding community composition

Results are not representative, but significative: show a clear over-representation of certain demographics.

How do users contribute to OSM?

A dashboard showing how groups contribute to OSM

Initial findings show differences in how women contribute to OSM.

Inclusive cartographies

Planning a series of workhops aimed at creating maps that address particular needs of underrepresented minorities using data that is available in OpenStreetMap

Prompts:

  • Aim: what would you want the map to help you with?

  • Iconography: how are the icons being used?

  • Information being displayed or not: what do you want to be seen? How would you like it to be represented?

Audience:

  • People from non-hegemonic demographics (women, racialized, LGTBQ+) who feel that current maps do not sufficiently address their needs.

  • Map enthusiasts, Data visualization enthusiasts and OpenStreetMap users with a keen interest in EDI issues.

  • Activists, Researchers, or people sensitive to EDI issues and inequalities (we will particularly welcome people interested in issues related to gender, race or queer topics).

You are invited!

WEB Dubois map of Philadelphia

Resources and Tools

Chart types

Source: Abela (2006)

Source: (Financial Times 2021)

Tools

RawGraphs (<https://www.rawgraphs.io/>)

RawGraphs (<https://www.rawgraphs.io/>)

DataWrapper (<https://www.datawrapper.de/>)

DataWrapper (<https://www.datawrapper.de/>)

Plotly’s [Chart Studio](https://chart-studio.plotly.com/)

Plotly’s [Chart Studio](https://chart-studio.plotly.com/)

Plotly is an opensource library for interactive visualisations using R and Python. Chart Studio provides an interface to generate visualisations and code for free.

Conclusions

After deconstructing how we understand data vis, it may seem we are agin at the start point, but we have acquired a refreshed undersanding about:

  • What data visualisations are and how are can be used (in research)
    • They can be way more than mere communicative artifacts that come at the end of the process
  • Some criteria and key concepts to consider when creating good* data visualisations

Also:

  • You’ve learnt about useful tools and resources to get started with!

You’re now equipped with the basics to get started this amazing path! (if you want to)

* Good as in “successfully achieving their goal”, not as in an aesthetical sense.

Thanks!

Carlos Cámara-Menoyo

Senior Research Software Engineer

@ccamara.scholar.social

@ccamara.scholar.social.ap.brid.gy

Task

30’: Create a data visualisation for given task

  1. Choose a dataset
  2. Think of a task you want to do with it
    1. Explore
    2. Analyse
    3. Communicate
  3. Think of a target audience you want to engage with
  4. Using a software of your choice, develop a data visualisation aimed at adressing your task (#2) for the desired audience (#3)

15’: Share your experience

  1. How happy are with the results?
  2. How easy/difficult was to use the tool?
  3. Could you do everything you wanted to do with the tool?
  4. Is there anything you would have done differently?

Data

Software:

References

Abela, Andrew V. 2006. “Choosing a Good Chart.” https://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html.
Cámara-Menoyo, Carlos, João Porto de Albuquerque, Joanna Suchomska, Grant Tregonning, and Greg McInerny. 2024. “Co-Designing Grounded Visualisations of the Food-Water-Energy Nexus to Enable Urban Sustainability Transformations.” Environmental Science & Policy 154 (April): 103712. https://doi.org/10.1016/j.envsci.2024.103712.
d’Ignazio, Catherine, and Lauren F Klein. 2016. “Workshop on Visualization for the Digital Humanities (VIS4DH).” In. Baltimore. IEEE: Workshop on Visualization for the Digital Humanities (VIS4DH), Baltimore. IEEE.
———. 2023. Data Feminism. S.l.: MIT PRESS.
Financial Times. 2021. “Visual Vocabulary,” February. https://github.com/Financial-Times/chart-doctor/blob/main/visual-vocabulary/FT4schools_RGS.pdf.
Kitchin, Rob. 2014. The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences. SAGE Publications Ltd. https://doi.org/10.4135/9781473909472.
Matejka, Justin, and George Fitzmaurice. 2017. “ACM SIGCHI Conference on Human Factors in Computing Systems.” In. https://www.autodesk.com/research/publications/same-stats-different-graphs.
McInerny, Greg J. 2018. “Visualising Data - a view from design space.” In, edited by Celia Lury, Rachel Fensham, Alexandra Heller-Nicholas, Sybille Lammes, Angela Last, Mike Michael, and Emma Uprichard, 133–41. Routledge international handbooks. London New York: Routledge.
Morrison, Romi. 2017. “Decoding Possibilities.” https://elegantcollisions.com/decoding-possibilities.
Munzner, Tamara. 2015. Visualization Analysis and Design. A.K. Peters Visualization Series. Boca Raton: CRC Press, Taylor & Francis Group.
Turkay, Cagatay, Carlos Cámara, and James Tripp. 2023. Data Science Across Disciplines. https://warwickcim.github.io/IM939_handbook/.
Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28. https://doi.org/10.1198/jcgs.2009.07098.